Dirty Data Project

Dataset 2 : Cake Ingredients

Here is my report about my first dataset.

Cleaning notes

The datasets I used were cake-ingredients-1961.csv + cake_ingredient_code.csv. To clean it I did the following :

library(tidyverse)
library(janitor)
cakes <- read_csv("raw_data/cake-ingredients-1961.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Cake = col_character(),
##   BM = col_logical()
## )
## See spec(...) for full column specifications.
codes <- read_csv("raw_data/cake_ingredient_code.csv")
## Parsed with column specification:
## cols(
##   code = col_character(),
##   ingredient = col_character(),
##   measure = col_character()
## )
  1. Ran a long pivot
  2. Ran a left Join, whilst dropping the NA values
  3. Ran the Janitor package to improve the variable names

Here is my beautiful pipe:

cakes <-
  cakes %>%
  pivot_longer(-Cake, names_to = "code", values_to = "amount") %>%
    left_join(codes) %>%
    drop_na(amount) %>%
    select(-code) %>%
    clean_names()
## Joining, by = "code"
cakes
## # A tibble: 156 x 4
##    cake          amount ingredient       measure   
##    <chr>          <dbl> <chr>            <chr>     
##  1 Angel           0.25 Almond essence   teaspoon  
##  2 Angel           1.25 Cream of tartar  teaspoon  
##  3 Angel          10    Egg white        one       
##  4 Angel           1    Sifted flour     cup       
##  5 Angel           1.5  Granulated sugar cup       
##  6 Angel           0.25 Salt             teaspoon  
##  7 Angel           1    Vanilla extract  teaspoon  
##  8 Babas au Rhum   0.25 Butter           cup       
##  9 Babas au Rhum   3    Dried currants   tablespoon
## 10 Babas au Rhum   1    Eggs             one       
## # … with 146 more rows

Which cake has the most cocoa in it?

cocoa_max <- cakes %>% filter(ingredient == "Cocoa") %>%
  arrange(desc(amount)) %>%
  head(1)
cocoa_max
## # A tibble: 1 x 4
##   cake               amount ingredient measure   
##   <chr>               <dbl> <chr>      <chr>     
## 1 One Bowl Chocolate     10 Cocoa      tablespoon

For sponge cake, how many cups of ingredients are used in total?

sponge_cups <- cakes %>% filter(cake == "Sponge") %>%
                          filter(measure == "cup") %>%
                          summarise(sponge_cups = sum(amount))

I got stuck on this question, then I remebered the summarise function

How many ingredients are measured in teaspoons?

teaspoons <- cakes %>% filter(measure == "teaspoon") %>%
            nrow()
teaspoons
## [1] 45

Which cake has the most unique ingredients?

n_ingerients <- cakes %>%
  group_by(cake) %>%
  count(n_distinct(ingredient)) %>%
  arrange(desc(n)) %>%
  head(1)

n_ingerients
## # A tibble: 1 x 3
## # Groups:   cake [1]
##   cake          `n_distinct(ingredient)`     n
##   <chr>                            <int> <int>
## 1 Babas au Rhum                       11    11

Caffeine consumption throughout this project

Day 1 Day 2
1 full French Press 1 French Press
4 Yorkshire Gold 3 Yorkshire Gold

1 Pot Peppermint Tea

I found this cheatsheet useful